Call:
lm(formula = GPA ~ . - ID, data = FirstYearGPA)
Residuals:
Min 1Q Median 3Q Max
-1.07412 -0.25827 0.05384 0.27675 0.85761
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5268983 0.3487584 1.511 0.13235
HSGPA 0.4932945 0.0745553 6.616 3.03e-10 ***
SATV 0.0005919 0.0003945 1.501 0.13498
SATM 0.0000847 0.0004447 0.190 0.84912
Male 0.0482478 0.0570277 0.846 0.39850
HU 0.0161874 0.0039723 4.075 6.53e-05 ***
SS 0.0073370 0.0055635 1.319 0.18869
FirstGen -0.0743417 0.0887490 -0.838 0.40318
White 0.1962316 0.0700182 2.803 0.00555 **
CollegeBound 0.0214530 0.1003350 0.214 0.83090
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3834 on 209 degrees of freedom
Multiple R-squared: 0.3496, Adjusted R-squared: 0.3216
F-statistic: 12.48 on 9 and 209 DF, p-value: 8.674e-16
Some of those variables seem like they could be strongly related, which could cause problems for the model. Check to see if collinearity is an issue for this model. To check this, look at:
Correlations among variables
Variable inflation factors (VIFs)
cor(FirstYearGPA)
GPA HSGPA SATV SATM Male
GPA 1.00000000 0.44688735 0.30431137 0.194343851 0.05284917
HSGPA 0.44688735 1.00000000 0.21032124 0.152839634 -0.09031714
SATV 0.30431137 0.21032124 1.00000000 0.526943819 0.14555703
SATM 0.19434385 0.15283963 0.52694382 1.000000000 0.37099167
Male 0.05284917 -0.09031714 0.14555703 0.370991668 1.00000000
HU 0.31465575 0.11603117 0.09874856 -0.009601549 -0.01884386
SS -0.00356909 -0.01725443 -0.02646987 -0.087839974 0.03507603
FirstGen -0.15657732 0.06418575 -0.25657713 -0.177387395 -0.07610526
White 0.28177214 0.04604668 0.36823365 0.259465227 0.07696022
CollegeBound -0.06302497 -0.20003903 0.06484473 0.039322063 0.09981773
ID -0.18693938 -0.24169420 -0.09436226 -0.047312698 0.10121274
HU SS FirstGen White CollegeBound
GPA 0.314655754 -0.00356909 -0.156577322 0.28177214 -0.06302497
HSGPA 0.116031169 -0.01725443 0.064185751 0.04604668 -0.20003903
SATV 0.098748556 -0.02646987 -0.256577125 0.36823365 0.06484473
SATM -0.009601549 -0.08783997 -0.177387395 0.25946523 0.03932206
Male -0.018843863 0.03507603 -0.076105261 0.07696022 0.09981773
HU 1.000000000 -0.30660787 -0.212565615 0.12593391 -0.02997230
SS -0.306607866 1.00000000 -0.023663260 0.01673417 -0.03170649
FirstGen -0.212565615 -0.02366326 1.000000000 -0.23790392 -0.11051100
White 0.125933908 0.01673417 -0.237903920 1.00000000 -0.02391156
CollegeBound -0.029972303 -0.03170649 -0.110511005 -0.02391156 1.00000000
ID -0.091577232 -0.12965617 0.002725596 -0.01524922 -0.01025733
ID
GPA -0.186939380
HSGPA -0.241694202
SATV -0.094362258
SATM -0.047312698
Male 0.101212743
HU -0.091577232
SS -0.129656169
FirstGen 0.002725596
White -0.015249216
CollegeBound -0.010257334
ID 1.000000000
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
vif(m1)
HSGPA SATV SATM Male HU SS
1.158078 1.605182 1.660002 1.205737 1.221530 1.147867
FirstGen White CollegeBound
1.186760 1.212049 1.073989
Based on the correlations between each predictor and the outcome (above), are there predictors that you think should be significant but aren’t? Why do you think they aren’t? (This is a bit of a philosophical question.)
Are there extreme cases that we might be concerned about? Check for extreme values in terms of:
13 variables had high leverage as measured by leverage measure from the hat matrix with a cut-off of 2k/n. An additional 7 had high discrepancy as measured by the externally studentized residuals with a cut-off of \pm 2.
There were no cases with high influence as measured by Cook’s D with a cut-off of greater than 1. This indicates that no cases were changing the overall predicted values.
Source Code
---title: "BTS 510 Lab 8"format: html: embed-resources: true self-contained-math: true html-math-method: katex number-sections: true toc: true code-tools: true code-block-bg: true code-block-border-left: "#31BAE9"---```{r}#| label: setupset.seed(12345)library(tidyverse)library(Stat2Data)theme_set(theme_classic(base_size =16))```## Learning objectives* **Assess** models for **(multi)collinearity** among predictors* **Conduct** outlier analyses to determine extreme and/or problematic cases## Data * `FirstYearGPA` data from the **Stat2Data** package: $n$ = 219 subjects * `GPA`: First-year college GPA on a 0.0 to 4.0 scale * `HSGPA`: High school GPA on a 0.0 to 4.0 scale * `SATV`: Verbal/critical reading SAT score * `SATM`: Math SAT score * `Male`: 1= male, 0= female * `HU`: Number of credit hours earned in humanities courses in high school * `SS`: Number of credit hours earned in social science courses in high school * `FirstGen`: 1= student is the first in her or his family to attend college, 0=otherwise * `White`: 1= white students, 0= others * `CollegeBound`: 1=attended a high school where >=50% students intended to go on to college, 0=otherwise## Research question(s)* How do **all these variables** impact first year GPA (`GPA`)? * Are there any problems with **collinearity** among the *predictors*? * Are there **problematic cases** that are *influencing the results*?## Tasks1. Run a **linear regression model** predicting `GPA` from *all other variables* in the dataset.```{r}library(Stat2Data)data(FirstYearGPA)FirstYearGPA <- FirstYearGPA %>%mutate(ID =row_number())head(FirstYearGPA)``````{r}m1 <-lm(data = FirstYearGPA, GPA ~ . - ID)summary(m1)```2. Some of those variables seem like they could be **strongly related**, which could cause problems for the model. Check to see if **collinearity** is an issue for this model. To check this, look at: * **Correlations among variables** * **Variable inflation factors (VIFs)**```{r}cor(FirstYearGPA)library(car)vif(m1)```3. Based on the **correlations between each predictor and the outcome** (above), are there predictors that you think **should** be significant but aren't? Why do you think they aren't? (This is a bit of a philosophical question.)4. Are there **extreme cases** that we might be concerned about? Check for extreme values in terms of: * **Predictors** * **Predicted values** * **Changes to predicted values**```{r}library(broom)m1_aug <-augment(m1)m1_augggplot(data = m1_aug,aes(x =c(1:nrow(m1_aug)), y = .hat)) +geom_point() +geom_hline(yintercept =3*(9+1)/219, color ="red",linetype ="dashed") +geom_hline(yintercept =2*(9+1)/219,color ="blue") +geom_text(aes(label=ifelse((.hat >2*(9+1)/219), as.character(ID), '')),hjust =0, nudge_x =2)``````{r}library(car)m1_stresid <- FirstYearGPA %>%mutate(esresid =rstudent(m1))head(m1_stresid)ggplot(data = m1_stresid,aes(x =c(1:nrow(m1_stresid)), y = esresid)) +geom_point() +geom_hline(yintercept =2, color ="blue",linetype ="dashed") +geom_hline(yintercept =-2,color ="blue",linetype ="dashed") +geom_text(aes(label=ifelse((esresid >2), as.character(ID), '')),hjust =0, nudge_x =2) +geom_text(aes(label=ifelse((esresid <-2), as.character(ID), '')),hjust =0, nudge_x =2)``````{r}ggplot(data = m1_aug,aes(x =c(1:nrow(m1_aug)), y = .cooksd)) +geom_point() +geom_hline(yintercept =1, color ="red",linetype ="dashed") +geom_text(aes(label=ifelse((.cooksd >1), as.character(ID), '')),hjust =0, nudge_x =2)```5. **Summarize** your findings about collinearity and outliers. *Use plain language.** Leverage ```{r}m1_aug %>%filter(.hat >2*(9+1)/219) %>%select(ID)```* Discrepancy```{r}m1_stresid %>%filter(esresid >2) %>%select(ID)m1_stresid %>%filter(esresid <-2) %>%select(ID)```* Influence```{r}m1_aug %>%filter(.cooksd >1) %>%select(ID)```13 variables had high leverage as measured by leverage measure from the hat matrix with a cut-off of 2k/n. An additional 7 had high discrepancy as measured by the externally studentized residuals with a cut-off of $\pm$ 2.There were no cases with high influence as measured by Cook's D with a cut-off of greater than 1. This indicates that no cases were changing the overall predicted values.